-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust): Add RLE to RLE_DICTIONARY
encoder
#15959
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15959 +/- ##
==========================================
- Coverage 81.29% 80.93% -0.36%
==========================================
Files 1381 1385 +4
Lines 176876 178291 +1415
Branches 3043 3050 +7
==========================================
+ Hits 143789 144307 +518
- Misses 32604 33493 +889
- Partials 483 491 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs
Outdated
Show resolved
Hide resolved
Hmm.. Do we actually compress those v1 pages then? 🤔
I don't think we should activate it for nested types then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. Interesting improvement. I've left some comments.
crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs
Outdated
Show resolved
Hide resolved
crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs
Outdated
Show resolved
Hide resolved
Ok(()) | ||
} | ||
|
||
#[allow(clippy::comparison_chain)] | ||
pub fn encode_bool<W: Write, I: Iterator<Item = bool>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same logic as encode_u32
, can we make a single generic and dispatch this function to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very new to Rust and took a while to figure this out. Not sure if I implemented this correctly.
crates/polars-parquet/src/parquet/encoding/hybrid_rle/encoder.rs
Outdated
Show resolved
Hide resolved
Yup, here's the relevant code. polars/crates/polars-parquet/src/parquet/write/compression.rs Lines 23 to 38 in 864e750
That makes sense. I'd like to revisit this later, but I think I'll need to learn some more Rust before tackling that problem. Thank you for the review! |
Ok, I think there is value in this, so I will pick up those points and get this in. |
Thank you for the PR @thalassemia |
This reverts commit 6730a72.
We had to revert this because of #16109 I do think these changes were interesting, so if you or anyone else has time to find the cause, that'd be appreciated. |
@ritchie46 So sorry for this and thank you @nameexhaustion for fixing it! I should've written a test with more random, realistic data. Since this feature still sounds worthwhile, I'll look into this and include more robust tests for my next PR. |
The
RLE_DICTIONARY
encoder currently only supports bit-packing. After this PR, the encoder will switch between bit-packing and RLE depending on whether a run is longer than 8 (always bit-pack a multiple of 8 values at a time).This addresses the specific case pointed out in #10680. With these changes, I get 2745 bytes for
pl_zstd.pq
and 2962 bytes forpa_zstd.pq
after running the following code:At first, the additional logic did slow down the encoder significantly. To address this, I did some profiling and made two optimizations.
u32
keys and downcasted tou16
if possible. I removed this downcasting step because it seemed unnecessary (f7e803a).After these optimizations, the
RLE_DICTIONARY
encoder performs on par with or only slightly worse thanuse_pyarrow=True
in most cases (faster than current encoding performance). However, the performance can be up to 50% worse thanuse_pyarrow=True
(matching current encoding performance) with high cardinality data that frequently switches between RLE and bit-packing:Some other changes:
RLE_DICTIONARY
encoding (9c73a61). For some reason, dictionary arrays do not scale well for large nested columns. I cannot figure out what is causing this. For example:This does not fully resolve the linked issue because it still does not provide users any way of manually specifying encodings like in PyArrow.